Prosper Peer-to-peer Lending Platform
This dataset contains a list of loans of Prosper Peer-to-peer Lending Platform from 2005 to 2016 with 81 variables in total. I have chosen this dataset because it contains various variables related to peer to peer lending that can be explored to understand the basic purpose of getting P2P loan, overall customer behaviour in the P2P platform.
## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
With so many variables in the datasets, I need to first understand the meaning of the variables based on the given variable definitions.
Because it is a loan data, some of the initial thoughts I have is to analyze on variables related to income, profession / employment, loan amount, credit info of the borrower.
As such, I have shortlisted 18 variables for the analysis in this project: EmploymentStatusDuration, ListingCategory..numeric., CreditScoreRangeLower, CurrentCreditLines, DebtToIncomeRatio, StatedMonthlyIncome, Occupation, BorrowerState, EmploymentStatus, IsBorrowerHomeowner, IncomeRange, IncomeVerifiable,BorrowerRate, Term, LoanStatus, LoanOriginalAmount, LoanOriginationDate, ProsperRating..Alpha.
## EmploymentStatusDuration ListingCategory..numeric. CreditScoreRangeLower
## Min. : 0.00 Min. : 0.000 Min. : 0.0
## 1st Qu.: 26.00 1st Qu.: 1.000 1st Qu.:660.0
## Median : 67.00 Median : 1.000 Median :680.0
## Mean : 96.07 Mean : 2.774 Mean :685.6
## 3rd Qu.:137.00 3rd Qu.: 3.000 3rd Qu.:720.0
## Max. :755.00 Max. :20.000 Max. :880.0
## NA's :7625 NA's :591
## CurrentCreditLines DebtToIncomeRatio StatedMonthlyIncome
## Min. : 0.00 Min. : 0.000 Min. : 0
## 1st Qu.: 7.00 1st Qu.: 0.140 1st Qu.: 3200
## Median :10.00 Median : 0.220 Median : 4667
## Mean :10.32 Mean : 0.276 Mean : 5608
## 3rd Qu.:13.00 3rd Qu.: 0.320 3rd Qu.: 6825
## Max. :59.00 Max. :10.010 Max. :1750003
## NA's :7604 NA's :8554
## Occupation BorrowerState EmploymentStatus
## Other :28617 CA :14717 Employed :67322
## Professional :13628 TX : 6842 Full-time :26355
## Computer Programmer : 4478 NY : 6729 Self-employed: 6134
## Executive : 4311 FL : 6720 Not available: 5347
## Teacher : 3759 IL : 5921 Other : 3806
## Administrative Assistant: 3688 : 5515 : 2255
## (Other) :55456 (Other):67493 (Other) : 2718
## IsBorrowerHomeowner IncomeRange IncomeVerifiable
## False:56459 $25,000-49,999:32192 False: 8669
## True :57478 $50,000-74,999:31050 True :105268
## $100,000+ :17337
## $75,000-99,999:16916
## Not displayed : 7741
## $1-24,999 : 7274
## (Other) : 1427
## BorrowerRate Term LoanStatus
## Min. :0.0000 Min. :12.00 Current :56576
## 1st Qu.:0.1340 1st Qu.:36.00 Completed :38074
## Median :0.1840 Median :36.00 Chargedoff :11992
## Mean :0.1928 Mean :40.83 Defaulted : 5018
## 3rd Qu.:0.2500 3rd Qu.:36.00 Past Due (1-15 days) : 806
## Max. :0.4975 Max. :60.00 Past Due (31-60 days): 363
## (Other) : 1108
## LoanOriginalAmount LoanOriginationDate ProsperRating..Alpha.
## Min. : 1000 2014-01-22 00:00:00: 491 :29084
## 1st Qu.: 4000 2013-11-13 00:00:00: 490 C :18345
## Median : 6500 2014-02-19 00:00:00: 439 B :15581
## Mean : 8337 2013-10-16 00:00:00: 434 A :14551
## 3rd Qu.:12000 2014-01-28 00:00:00: 339 D :14274
## Max. :35000 2013-09-24 00:00:00: 316 E : 9795
## (Other) :111428 (Other):12307
## MonthlyLoanPayment
## Min. : 0.0
## 1st Qu.: 131.6
## Median : 217.7
## Mean : 272.5
## 3rd Qu.: 371.6
## Max. :2251.5
##
## 'data.frame': 113937 obs. of 19 variables:
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ ListingCategory..numeric.: int 0 2 0 16 2 1 1 2 7 7 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## [1] "EmploymentStatusDuration" "ListingCategory..numeric."
## [3] "CreditScoreRangeLower" "CurrentCreditLines"
## [5] "DebtToIncomeRatio" "StatedMonthlyIncome"
## [7] "Occupation" "BorrowerState"
## [9] "EmploymentStatus" "IsBorrowerHomeowner"
## [11] "IncomeRange" "IncomeVerifiable"
## [13] "BorrowerRate" "Term"
## [15] "LoanStatus" "LoanOriginalAmount"
## [17] "LoanOriginationDate" "ProsperRating..Alpha."
## [19] "MonthlyLoanPayment"
There is a good mix of both discrete and continuous variables, which can be further explored in the univariate plot section below.
This section analyzes the distribution and characteristics of single variables that were selected above.
Here are some custom functions to help summarize the information of variables that will be used later.
First thing that comes to my mind when we talk about loan is income. So, let’s first look at the stated monthly income of users.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750000
Because the stated monthly income is extremely skewed in the positive side (max income = 1750000), we need to limit the x-axis to display a more observable histogram distribution. After limiting the x-axis to 20000, We can see that the common range of stated monthly income falls between 3000 to 6000, which is in line with the IQR distribution shown.
From the boxplot above, we can see that there are outstandingly monthly incomes from $12k and above. The median monthly income is around $4.5k, and the mean loan is higher than the median, suggesting a right skewed distribution.
Next, we will analyze the annual income range of users.
## freq percentage
## $0 621 0.5450380
## $1-24,999 7274 6.3842299
## $100,000+ 17337 15.2163037
## $25,000-49,999 32192 28.2542107
## $50,000-74,999 31050 27.2519024
## $75,000-99,999 16916 14.8468013
## Not displayed 7741 6.7941055
## Not employed 806 0.7074085
The annual income range tallies with the stated monthly income. In this case, 55% of the users are in the income range of $25K to $75K. Most of the users (>90%) have stated their income range, except for the category of $0, “Not displayed”, “Not employed”.
Next, we may be interested to know whether the stated income range is already verified or not.
## freq percentage
## False 8669 7.608591
## True 105268 92.391409
Fortunately, most of the users (>90%) have their income verified in the Prosper platform.
After understanding the distribution of income-related variables, we will move on to analyze the occupation of users.
## Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## freq percentage
## Other 28617 25.11651176
## Professional 13628 11.96099599
## Computer Programmer 4478 3.93024215
## Executive 4311 3.78366992
## Teacher 3759 3.29919166
## Administrative Assistant 3688 3.23687652
## Analyst 3602 3.16139621
## 3588 3.14910872
## Sales - Commission 3446 3.02447844
## Accountant/CPA 3233 2.83753302
## Clerical 3164 2.77697324
## Sales - Retail 2797 2.45486541
## Skilled Labor 2746 2.41010383
## Retail Management 2602 2.28371820
## Nurse (RN) 2489 2.18454058
## Construction 1790 1.57104365
## Truck Driver 1675 1.47011068
## Laborer 1595 1.39989643
## Police Officer/Correction Officer 1578 1.38497591
## Civil Service 1457 1.27877687
## Engineer - Mechanical 1406 1.23401529
## Military Enlisted 1272 1.11640644
## Food Service Management 1239 1.08744306
## Engineer - Electrical 1125 0.98738777
## Food Service 1123 0.98563241
## Medical Technician 1117 0.98036634
## Attorney 1046 0.91805120
## Tradesman - Mechanic 951 0.83467179
## Social Worker 741 0.65035941
## Postal Service 627 0.55030412
## Professor 557 0.48886665
## Realtor 543 0.47657916
## Doctor 494 0.43357294
## Nurse (LPN) 492 0.43181758
## Nurse's Aide 491 0.43093991
## Tradesman - Electrician 477 0.41865241
## Waiter/Waitress 436 0.38266761
## Fireman 422 0.37038012
## Scientist 372 0.32649622
## Military Officer 346 0.30367659
## Bus Driver 316 0.27734625
## Principal 312 0.27383554
## Teacher's Aide 276 0.24223913
## Pharmacist 257 0.22556325
## Student - College Graduate Student 245 0.21503111
## Landscaping 236 0.20713201
## Engineer - Chemical 225 0.19747755
## Investor 214 0.18782310
## Architect 213 0.18694542
## Pilot - Private/Commercial 199 0.17465792
## Clergy 196 0.17202489
## Student - College Senior 188 0.16500347
## Car Dealer 180 0.15798204
## Chemist 145 0.12726331
## Psychologist 145 0.12726331
## Biologist 125 0.10970975
## Religious 124 0.10883207
## Flight Attendant 123 0.10795440
## Homemaker 120 0.10532136
## Tradesman - Carpenter 120 0.10532136
## Student - College Junior 112 0.09829994
## Tradesman - Plumber 102 0.08952316
## Student - College Sophomore 69 0.06055978
## Dentist 68 0.05968211
## Student - College Freshman 41 0.03598480
## Student - Community College 28 0.02457498
## Judge 22 0.01930892
## Student - Technical School 16 0.01404285
There are many categories (68) in occupation. As such, to visualize it in a bar chart, we will need to: - First, subset the occupation data to exclude ‘Other’ and ‘
Afterwards, we are able to visually comprehend the distribution of occupations among users.
From the summary table and the bar chart, we can observe that: - At least a quarter of the users do not wish to indicate their occupation. This is shown by occupation of ‘Other’ (>25%) and ‘
Next, we will analyze Loan Original Amount variable.
From the histogram above, We can see that there are some prominent peaks for original amount of loan at 4000, 10000, 15000. Some minor peaks include 2000, 3000, 5000. Interestingly, for higher amount of loan, 20000 and 25000 are the most common ones.
From the boxplot above, we can see that thare are outstandingly high loan amount from 22500 and above. The median loan is around $6k, and the mean loan is around $9k, suggesting a right skewed distribution.
Next, we will look at loan term variable.
## freq percentage
## 12 1614 1.416572
## 36 87778 77.040821
## 60 24545 21.542607
77% of the users have a loan term of 36 months (3 years), followed by 60 months and 12 months.
## freq percentage
## Cancelled 5 0.00438839
## Chargedoff 11992 10.52511476
## Completed 38074 33.41671274
## Current 56576 49.65551138
## Defaulted 5018 4.40418828
## FinalPaymentInProgress 205 0.17992399
## Past Due (>120 days) 16 0.01404285
## Past Due (1-15 days) 806 0.70740848
## Past Due (16-30 days) 265 0.23258467
## Past Due (31-60 days) 363 0.31859712
## Past Due (61-90 days) 313 0.27471322
## Past Due (91-120 days) 304 0.26681412
Fortunately, around 82% of the users have “Current” or “Completed” loan status. These two statuses can be considered as positive categories of loan status because they mean that the users have either completed their loan payment or on track in making their loan payment.
On the flip side, around 15% of the users have “Chargedoff” or “Defaulted” status. These two statuses can be considered as negative statuses because they mean that the users are not able to make the loan payment and defaulted their loan.
## freq percentage
## 2005 22 0.01930892
## 2006 5906 5.18356636
## 2007 11460 10.05819005
## 2008 11552 10.13893643
## 2009 2047 1.79660690
## 2010 5652 4.96063614
## 2011 11228 9.85456875
## 2012 19553 17.16123823
## 2013 34345 30.14385143
## 2014 12172 10.68309680
## freq percentage
## 1 11395 10.001141
## 2 9728 8.538052
## 3 8555 7.508535
## 4 7559 6.634368
## 5 8500 7.460263
## 6 8847 7.764817
## 7 9154 8.034265
## 8 9592 8.418688
## 9 9221 8.093069
## 10 11043 9.692198
## 11 9635 8.456428
## 12 10708 9.398176
## # A tibble: 10 × 4
## LoanYear LoanTotalSum LoanTotalAvg LoanCount
## <fctr> <int> <dbl> <int>
## 1 2014 144995536 11912.220 12172
## 2 2013 362170278 10545.066 34345
## 3 2012 153175116 7833.842 19553
## 4 2011 75138013 6692.021 11228
## 5 2010 26940486 4766.540 5652
## 6 2009 8914396 4354.859 2047
## 7 2008 69561850 6021.628 11552
## 8 2007 80787786 7049.545 11460
## 9 2006 28132199 4763.325 5906
## 10 2005 78687 3576.682 22
## # A tibble: 12 × 4
## LoanMonth LoanTotalSum LoanTotalAvg LoanCount
## <fctr> <int> <dbl> <int>
## 1 12 96412497 9003.782 10708
## 2 11 83632456 8680.068 9635
## 3 10 92936832 8415.904 11043
## 4 9 74354423 8063.596 9221
## 5 8 72201482 7527.260 9592
## 6 7 70113613 7659.342 9154
## 7 6 66166162 7478.938 8847
## 8 5 64585589 7598.305 8500
## 9 4 58124431 7689.434 7559
## 10 3 71830810 8396.354 8555
## 11 2 91013137 9355.791 9728
## 12 1 108522915 9523.731 11395
From the graphs and tables above, we can see a very different trend between comparison of average and frequency or sum of loan amount by month or year.
For loan amount by month: - Both sum and average of loan amount show similar months in which there are some peaks. However, if we rank it by the measures (either average or sum or freq), the top months will be different: 1) For average loan amount: Jan, Feb, Oct, Nov, Dec have the most average loan amount (ordered from top to bottom). The average loan is generally lower in the month of April to August. 2) For sum or frequency of loan amount: Jan, Oct, Dec, Feb, Nov have the most average loan amount (ordered by top to bottom). There is an obvious dip in sum and frequency of loan amount in April.
For loan amount by year: 1) Generally, the average loan amount is increasing throughout the years, only with some dips in 2008 to 2010. 2) The sum of loan amount peaks in 2013, with similar dips observed in 2008 to 2010. Perhaps, those dips are caused by financial crisis?
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 67.00 96.07 137.00 755.00 7625
EmploymentStatusDuration shows a right-skewed distribution with mean (96 months) > median (67 months). This shows that the employment status of users are more likely to stay unchanged in a shorter duration.
Employed and Full-Time are the most common employment statuses.
## [1] 113937
## # A tibble: 21 × 3
## listingCategory SumDollarTotal CountTotal
## <fctr> <int> <int>
## 1 1 - Debt Consolidation 577736197 58308
## 2 0 - Not Available 106096621 16965
## 3 3 - Business 64175191 7189
## 4 7- Other 62035775 10494
## 5 2 - Home Improvement 60148466 7433
## 6 6 - Auto 12861665 2572
## 7 4 - Personal Loan 10913226 2395
## 8 13 - Household Expenses 10549832 1996
## 9 15 - Medical/Dental 9929936 1522
## 10 14 - Large Purchases 7684187 876
## # ... with 11 more rows
## # A tibble: 21 × 3
## listingCategory AvgDollarTotal CountTotal
## <fctr> <dbl> <int>
## 1 1 - Debt Consolidation 9908.352 58308
## 2 8 - Baby&Adoption 9751.397 199
## 3 3 - Business 8926.859 7189
## 4 20 - Wedding Loans 8835.726 771
## 5 14 - Large Purchases 8771.903 876
## 6 9 - Boat 8734.118 85
## 7 12 - Green Loans 8456.576 59
## 8 17 - RV 8148.769 52
## 9 2 - Home Improvement 8092.085 7433
## 10 11 - Engagement Ring 7636.945 217
## # ... with 11 more rows
## freq percentage
## 1 - Debt Consolidation 58308 51.17564970
## 0 - Not Available 16965 14.88980753
## 7- Other 10494 9.21035309
## 2 - Home Improvement 7433 6.52378069
## 3 - Business 7189 6.30962725
## 6 - Auto 2572 2.25738785
## 4 - Personal Loan 2395 2.10203885
## 13 - Household Expenses 1996 1.75184532
## 15 - Medical/Dental 1522 1.33582594
## 18 - Taxes 885 0.77674504
## 14 - Large Purchases 876 0.76884594
## 20 - Wedding Loans 771 0.67668975
## 19 - Vacation 768 0.67405672
## 5 - Student Use 756 0.66352458
## 16 - Motorcycle 304 0.26681412
## 11 - Engagement Ring 217 0.19045613
## 8 - Baby&Adoption 199 0.17465792
## 10 - Cosmetic Procedure 91 0.07986870
## 9 - Boat 85 0.07460263
## 12 - Green Loans 59 0.05178300
## 17 - RV 52 0.04563926
Based on the results shown above, if we look at the bar chart showing the frequency of listing categories, we can observe that around 51% of the listing categories are of Debt Consolidation. Therefore, it does not really tell us much insight as debt consolidation is a very generic term for loan repayment.
As such, by observing the listing category by average loan amount, we can observe some interesting insights. The top 10 listing categories are: Debt Consolidation, Baby&Adoption, Business, Wedding Loans, Large Purchases, Boat, Green Loans, RV, Home Improvement, Engagement Ring.
Some interesting observations: - Marriage seems to play a huge role in loan: Baby&Adoption, Wedding Loans, Engagement Ring - Home improvement has a higher frequency than Business. - Beyond the top 10, Household Expenses, Auto, Vacation, Medical/Dental have significant frequency number as well.
Combining all the three points above, we can see that most of the users are willing to take loan in category such as family, wedding, housing, health, business, and lifestyle.
From the histogram of Debt to Income Ration (DIR) above, we can see that most of the users have debt that is around 10% to 30% of their income. It makes me wonder who will take loan which is 50% or more above their income? Maybe, this is worth investigating in the multivariate analysis in later part of the project.
Both the boxplot and histogram show that the DIR has a right-skewed distribution with users having debt that is more than their income. The median and mean of the DIR is around 20% and the IQR is 10% to 30% (Borrowers usually have a debt that is 10% to 30% of the income)
the next step is to estimate the monthly debt by multplying the debt to income ratio with the stated monthly income.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 637.5 1062.0 1225.0 1622.0 171000.0 8554
From the distribution of monthly debt above, it can be seen that most of the users have monthly debt around 500 to 1500. This is in line with the debt-to-income ratio, which is around 10% to 30% of the monthly income (around $3000 to $6000) for most of the users.
From the boxplot above, we can see that there are extremely high monthly debt above $3k. The median and mean monthly debt are around $1k.
## num [1:113937] 0.192 0.145 0.0755 0.0925 0.1355 ...
## [1] "double"
## [1] "numeric"
The distribution of borrower rate is somewhat bell-curve like with slightly right-skewed trend and some intermittent spikes throughout the trend. There is a very obvious mega spike at around 31%. After some online research (http://www.lendingmemo.com/rates-fees-lending-club-prosper/),we can understand that users with 31% borrower rate usually falls within Prosper credit rating of E or HR.
## freq percentage
## 29084 25.526387
## C 18345 16.101003
## B 15581 13.675101
## A 14551 12.771093
## D 14274 12.527976
## E 9795 8.596856
## HR 6935 6.086697
## AA 5372 4.714886
Around 25% of the loans are not rated, while the rest of the ratings are almost similar at around 15%, except for the smaller proportion of AA, E, and HR rating.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
## [1] "integer"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
## [1] 640 680 820 740 680 760 540 NA 700 700 540 720 600 680 520 760 580
## [18] NA 580 700
## freq percentage
## 0 133 1.173398e-01
## 360 1 8.822543e-04
## 420 5 4.411272e-03
## 440 36 3.176116e-02
## 460 141 1.243979e-01
## 480 346 3.052600e-01
## 500 554 4.887689e-01
## 520 1593 1.405431e+00
## 540 1474 1.300443e+00
## 560 1357 1.197219e+00
## 580 1125 9.925361e-01
## 600 3602 3.177880e+00
## 620 4172 3.680765e+00
## 640 12199 1.076262e+01
## 660 16366 1.443897e+01
## 680 16492 1.455014e+01
## 700 15471 1.364936e+01
## 720 12923 1.140137e+01
## 740 9267 8.175851e+00
## 760 6606 5.828172e+00
## 780 4624 4.079544e+00
## 800 2644 2.332680e+00
## 820 1409 1.243096e+00
## 840 567 5.002382e-01
## 860 212 1.870379e-01
## 880 27 2.382087e-02
After summarizing CreditScoreRangeLower in frequency table, we can see that the variable is more appropriate to be converted as a discrete variable. As such, a bar chart is chosen over a histogram.
From the bar chart of credit score range lower, we can see that the most common credit score is between 640 to 740.
Most of the users have around 3 to 15 current credit lines with the median around 6 to 8 credit lines.
The top borrower states are California, Texas, New York, Florida, and Illinois
## False True
## 56459 57478
We have almost equal proportion of borrowers who are also homeowners vs borrowers who are not homeowners.
Number of rows in the dataset: 113,937. Out of 81 available variables, I have chosen 18 of them with 11 of them being discrete then the other 8 being continuous.
Continuous Variables: EmploymentStatusDuration CreditScoreRangeLower CurrentCreditLines DebtToIncomeRatio StatedMonthlyIncome BorrowerRate LoanOriginalAmount MonthlyLoanPayment
Discrete Variables: ListingCategory..numeric. BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner IncomeRange IncomeVerifiable Term LoanStatus LoanOriginationDate ProsperRating..Alpha.
In general, most of the loans are in loan terms of 3 years and the loan amount is under $10k with median of around $6.5k. Most of the monthly payments are below $500 with a median of $217 and a median interest rate of 18%.
The most common loan listing category is debt consolidation (51%). Interestingly, if we analyze the loan listing category by the average loan amount, we can see that listing category with the most average loan amount is related to family, wedding, housing, health, business, and lifestyle.
The credit score range of borrwers are around the range of 660 to 720 (CreditScoreRangeLower IQR). The median monthly income of users is 4667 with 28% of them earn $25k to $50k, 27% earn $50k to $75k. Most of the borrowers are employed in one way or another (full-time, employed, self-employed)
Some of the important features are related to income,loan, and credit. For example, credit score, debt to income ratio, monthly/annual income, interest rate and loan term.
Anyone who borrows money will have some objectives in mind. As such, we can look at loan listing category to find out why borrowers decided to take loans, and also analyze loan status to understand whether the loans are being paid regularly.
One new variable (TotalMonthlyDebt) is created by multiplying debt to income ratio and stated monthly income. The distribution of the TotalMonthlyDebt is positively skewed, meaning that more borrowers have a loan amount in the lower end.
Loan Origination Date is further categorized into LoanYear, and LoanMonth. This allows us to analyze the peridical changes of loan in terms of year and month.
ListingCategory is also mapped into the appropriate textual description, so that it is more readable.
Some of the numeric variables are analyzed as a factor. This includes: Term and CreditScoreRangeLower. In this case, CreditScoreRangeLower seems to follow a steady increase of “20” score without any score in between. As such, I analyze it as a discrete factor variable.
First, we will start looking at a correlation matrix of all the chosen variables to have an overview if there is any interesting relationship between two variables.
The size of circle represents the strength of the correlation. The color represents the strength and the direction of the correlation as shown by the legend.
Darker blue represents stronger positive correlation, while darker red represents stronger negative relationship.
Here are some significant correlation observed in the correlation matrix: - IncomeRange_reordered and LoanOriginalAmount - ProsperRating and BorrowerRate - BorrowerRate and CreditRating - TotalMonthlyDebt and CurrentCreditLine - BorrowerRate and CreditScoreRangeLower - StatedMonthlyIncome and TotalMOnthlyDebt - BorrowerRate and LoanOriginalAmount
We will analyze the significant relationships in the further steps.
Initialy we analyzed income range and loan amount separately, it would be interesting to know if borrowers in higher income range take more loan than borrowers in lower income range.
From the boxplot above, we can observe that users with higher income tend to take a higher median loan amount.
From the boxplot above, we can see that AA has the best (lowest) median credit rating, followed by A to E and HR is the worst (highest) credit rating.
## [1] 0.4733034
R^2 for Current Credit Lines vs Total Monthly Debt is 0.47, which indicates a strong positive correlation. This suggests that the more credit lines a borrower has, the more monthly debt the person has to pay.
What if we analyze both borrower rate and credit score range lower together? Is there any relationship between them?
Now, we can see that borrower rate goes lower as the credit score goes higher. This means that users with higher credit score are usually perceived as more credible, and as such, may be more likely to be given a lower borrower rate.
## [1] -0.4615667
R^2 for Current Credit Score vs Borrower Rate is -0.46, which indicates a negative correlation. This suggests that the more creditworthy a borrower is (as shown by credit score), the lower the borrower rate is.
## [1] 0.3604392
R^2 for Stated Monthly Income vs Total Monthly Debt is 0.36, which indicates a positive correlation. This suggests that users with more monthly income have a tendency to take up higher amount of monthly debt.
## [1] -0.3289599
R^2 for Loan Original Amount vs Borrower Rate is -0.33, which indicates a negative correlation. This seems a bit counterintuitive because higher loan usually signifies higher risk, which leads to a higher interest rate. However, in this case, it seems to be the opposite. Probably, there is some lurking variable which correlates with both the variables.
## [1] 0.3408745
R^2 for Current Credit Score vs Borrower Rate is 0.34, which indicates a positive correlation. Same case as “Loan Amount vs Borrower Rate”, it seems a bit counterintuitive because high loan amount should have higher risk of default , and thus, should correlate to lower credit score.
In layman term, if someone borrows a lot of money from you, it is harder to trust that person will pay you back fully as opposed to a lower loan amount.
However, in this case, it seems to be opposite. So, there should be some lurking variable which correlates with both the variables.
As shown by the IQR range in the boxplot of “Income Range vs Loan Original Amount”, borrowers in higher income range tend to have higher loan amount.
In the second boxplot (“Income Range vs Borrower Rate”), we can see that borrowers in higher income range have a lower borrower rate. Meanwhile in the third boxplot (“Income Range vs Credit Score Range Lower”), it shows that borrowers in higher income range have a higher credit score.
As such, the income of borrowers is a more logical variable that leads to the lower borrower rate for a higher loan amount, as well as higher credit score for a higher loan amount.
In the univariate analysis section, we saw that there is a significant spike of borrower rate at around 0.32. We will explore this occurrence in more detail.
## Error in check_breaks_labels(breaks, labels): object 'listingCategory' not found
An in-depth analysis of the spike at BorrowerRate of 0.32 shows that the high interest rate is given mostly for Debt Consolidation in 2011 to 2012. In addtion, the high interest rate mostly covers credit score ranges from 660 to 720.
There is an obvious decrease of median loan amount in 2009 followed by a steady increase of median loan amount afterwards.
Interestingly, in 2012, we can see that the median of the loan amount is much lower than the Q3 of the loan amount. This means that there are more people borrowing at the higher end (50th to 75th percentile) of loan amount as compared to the lower end (25th to 50th percentile) of loan amount.
When we sum up the loan amount by loan year, it tells a slightly different story. In 2009, there is a similar dip of loan amount followed by a steady increase of sum of loan amount. However, in 2014, we can see a significant decrease in the sum of loan amount, which is not clearly depicted in the previous box plot of loan amount by loan year.
## [1] "factor"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NA NA NA "113937"
## Min. 1st Qu. Median Mean 3rd Qu.
## "2005-11-15" "2008-10-02" "2012-06-26" "2011-07-21" "2013-09-18"
## Max.
## "2014-03-12"
In the previous 2 charts, we can see that there is a huge dip in 2009. By analyzing loan amount throughout loan date, we can notice that there is almost no loan happening from November 2008 to July 2009. With a simple search on Prosper Marketplace Wiki, we can understand that there is a cease and desist order on Prosper in November 2008. Prosper resumes its service in July 2009.
The chart above shows a very obvious seasonality trend, in which the peak of median loan amount is at the holiday season (November to December) and beginning of the year (January, February).
## Error in check_breaks_labels(breaks, labels): object 'listingCategory' not found
In a glance, the top 5 listing categories with the highest median loan amount are: - Debt Consolidation - Baby & Adoptions - Wedding Loans - Business - Boat
Some very interesting observations: loan for Baby & Adoption and Wedding have higher loan amount than business. This makes us think twice of whether to get married with sufficient savings (or, we need to take loan, which may not be worthwhile in the long run).
Besides that, people do borrow quite a high amount of money (as shown by median loan amount) to buy boats? I do not really comprehend the logic behind this. Shouldn’t someone have lots of spare cash first before they buy such a luxurious item?
Users who are employed has the highest median loan amount followed by self-employed users and full-time.
Median Loan amount increases as the loan term increases.
The median interest rate is higher for loan term of 3 and 5 years. The variance in interest rate is much bigger than loans in the other 2 loan terms.
Borrowers with loan term of three years have a lower median credit score as compared to borrowers in the two other loan terms.
Monthly income has a positive relationship with monthly debt.
Current credit lines variable has a strong positive relationship with total monthly debt. The more credit lines a user has, the higher the amount of total monthly debt. -Borrower rate
Loan amount has a negative relationship with borrower rate. In addition, loan amount has a positive correlation with credit score. The relationships are counterintuitive because higher loan amount should have higher risk, which leads to higher borrower rate and lower credit score. As such, I suspect there is some lurking variable.
Upon further analysis, we can see that income range has a positive relationship with loan amount (the more income a person is, the higher the loan amount is). Income range is also positively correlated with borrower rate and negatively correlated with credit score. As such, it is more logical to think that income range is the variable affecting the relationship between loan amount and borrower rate / credit score.
The top 5 listing categories with the highest median loan amount are Debt Consolidation, Baby & Adoptions, Wedding Loans, Business, Boat. It seems that marriage really has quite a corelation with debt since “Baby & adoptions” and “Wedding” are in the top 5 listing categories (you get married first, then you have a baby, but you will need lots of money)
The median loan amount is usually higher during holiday seasons (Nov to Feb)
Borrowers who are having employment status as “employed” and “fulltime” have a larger loan amount as compared to the other groups.
The is a strong correlation betwen long term and loan amount, meaning that the longer ther loan term is, the larger the loan amount.
In the year 2008 to 2009, there was a huge decrease in the loan amount. With a more in-depth investigation, the huge dip is due to the cease and desist order by the SEC on the Prosper P2P lending model. This happened from November 2008 to July 2009. After the incident, Prosper reopened agin with a steady increase in the loan mount, with a peak of around $300 million of loans in 2013, and $150 million of loans in 2014.
Credit score and interest rate have the strongest relationship (R^2 = .46). This is reasonable because credit score is a way to assess the “trustworthiness” of borrowers. As such, the more trustworthy the borrowers are, the more likely they will receive a higher credit score.
In addition, total monthly debt and open credit lines are strongly correlated (R^2 = .47). This is probably due to the cumulative effect of multiple loans that creates a huge sum of monthly debt amount.
From the first chart, we can see that the credit score is obviously lower for a lower borrower rate (look at borrower rate >= 0.1). However, if we base on loan amount alone, it is a bit hard to see whether loan amount has any obvious relationship with credit score with this chart.
As such, I created a second chart with stated monthly income on the x-axis, and coloured by loan amount. By looking at borrower rate <= 0.1 and stated monthly income >=$20k, we can see taht the amount of red coluor (high loan amount) is much less as compared to black (low loan amount). This means that if a borrower with high income wants to take a low loan amount, the borrower rate will be more likely to be low.
On the plot above, we can analyze the monthly loan payment to income ratio (monthly loan payment / monthly income). In this case, borrowers in the lower income range have a higher median monthly loan payment to income ratio. This is probably because the monthly income itself is already lower than the rest of the income range. Therefore, it does not mean that the borrowers in the lower income range have have a higher montly loan payment as compared to the rest.
Borrowers in the lowest income range ($1-24,999) have the most fluctuating median monthly debt to income ratio. It has the lowest dip in 2013 and climb up to the highest peak in 2014.
Based on the chart above, borrowers across various income range have debts that are around 20% to 30% of their annual income, with an overall increasing median debt to income ratio. The variance of the debt to income ratio reduces as the income range increases.
By observing the chart above, we can see that in 2013, there is a very significant peak in the sum of loan amount. This occurrence applies to income range of $25k-$50k all the way to $100k+.
Another interesting finding is that borrowers in the income range of $50k-$75k have similar (if not more) sum of loan amount as compared to the higher income range.
As shown by “Loan Amount by Borrower Rate and Credit Score” chart, credit scores seem to influence the interest rate of borrowers. The better the credit score, the lower the interest rate.
As shown by “Monthly Income by Borrower Rate and Loan Amount” chart, loans that are lower risk (low loan amount and high income borrower) are more likely to have lower interest rate.
Based on “Monthly Loan Payment to Income Ratio by Loan Year Faceted by Income Range” chart, Borrowers in the lower income range seem to have higher monthly payment to income ratio. This is probably due to their lower income.
Based on “Debt to Income Ratio by Loan Year Faceted by Income Range” chart, borrowers across various income range have debts that are around 20% to 30% of their annual income, with an overall increasing median debt to income ratio.
Based on “”Loan Amount by Year Faceted by Income Range“, there is a very significant peak in the sum of loan amount in 2013. This occurrence applies to income range of $25k-$50k all the way to $100k+.
By looking at the loan amount by income range and year, we can see that borrowers in the income range of $50k-$75k have similar (if not more) sum of loan amount as compared to the higher income range.
By analyzing Debt to income ratio and Monthly loan payment to income ratio, we can see that the higher the income, the lower percentage of debt and monthly payment as compared to the income.
As shown by the chart of debt-to-income ratio across the years, we can observe that the Prosper platform seems to allow more borrowers with higher loan amount, particular for borrower range in the income range of $50-75k, $75-100k and $100k+.
## LoanYear LoanTotal LoanCount YoYLoanAmountGrowth YoYLoanCountGrowth
## 1 2005 78687 22 NA NA
## 2 2006 28132199 5906 356.5202892 267.454545455
## 3 2007 80787786 11460 1.8717196 0.940399594
## 4 2008 69561850 11552 -0.1389559 0.008027923
## 5 2009 8914396 2047 -0.8718494 -0.822801247
## 6 2010 26940486 5652 2.0221325 1.761113825
## 7 2011 75138013 11228 1.7890370 0.986553432
## 8 2012 153175116 19553 1.0385835 0.741449947
## 9 2013 362170278 34345 1.3644198 0.756507953
## 10 2014 144995536 12172 -0.5996482 -0.645596157
The graph shows the distribution and the sum of amount of loans by year. If we look at the year 2010 to 2013, after the SEC cease and desist order, there is a substantial bounce back of loan amount from 2010 to 2013. In addition, as shown in the boxplot, the median and mean of loan amount also both increased substantially after 2012.
As shown in YearlyLoanDistribution table, in 2010 the number of loans increased 176% year over year, in 2011 it increased 99%, in 2012 74% and in 2013 76%. This graph shows the distribution of loan amounts by year. The dollar amount of loans increased 202%, 179%, 104%, 136%, year over year respectively.
From the figures above, we can see that the overall risk of running the business is increasing throughout the year as shown by the higher number of loans with lower credit score across the years. In addition, the median and average borower rate fluctuate after after the SEC cease and desist order (after 2009). The median and average borrower rate gradually decreases (on the lower credit range) throughout 2011 to 2014. The IQR of the interest rate is also reduced greatly in 2013 and 2014 (More borrowers are in a closer borrower rate to each other). Probably, the move to lower the interest rate for lower credit score range is an attempt to increase the number of loans.
## Error in eval(expr, envir, enclos): object 'listingCategory' not found
## Error in check_breaks_labels(breaks, labels): object 'cats' not found
I categorize the bar chart above by loan listing category throughout the year. It looks like that there is no loan category before 2006. In addition, it seems that debt consolidation category dominated the loan category throughout the years. Year 2013 is the best year for prosper with over $350M loans. Within only the first 3 months of 2014, the platform already produced nearly $150M loans. This seems like it is going to hit another record in year 2014.
There is almost 114,000 loans from November 2005 to March 2014 in this dataset. Over the course of those years, the Prosper platform has made around $950 million in total.
The biggest difficulty in this dataset is the huge number of variables (81 variables), in which I did not know how to start analyzing this dataset. As such, I look at the data dictionary given to roughly estimate which variables are going to be useful for the data analysis process. Afterwards, I subsetted the full data set with only the variables that I wanted to analyze. Subsequently, I went through those variable to check if they would help to understand about the Prosper platform better. This includes understanding how the credit score works, how much users earn monthly, etc.
Analysing from the data, we can see that Prosper has quite interesting business trend. Initially, it struggled to get more users, and then subsequently got hit by cease and decist order by the SEC in 2009. I would say that they are very persistent as it continued the business after the cease and decist order and bounced back with an even better business after that.
This is an interesting data set and it has helped me to be more confident in analyzing data sets with huge number of variables.